Module 1 - Lab 1
From Raw Text to NLP Pipelines (SEC 10-K)
Lab Objective
In this lab, you will:
- Connect Google Colab to VS Code
- Load real-world corporate text data (SEC 10-K filings)
- Implement a classical NLP preprocessing pipeline
- Answer exploratory questions about corporate disclosures using text analytics
This lab establishes the computational and conceptual foundation for later work with embeddings and generative models.
Background Context
Public companies file Form 10-K annually with the U.S. Securities and Exchange Commission (SEC).
These filings contain rich textual information about:
- business operations
- risks and uncertainties
- management discussion
- regulatory disclosures
In this lab, we treat each 10-K as raw text data and apply a standard NLP pipeline to prepare it for analysis.
Connecting Google Colab to VS Code
You will use Google Colab as the execution backend while working inside VS Code.
Install the Colab Extension
Install the Google Colab extension in VS Code from one of the following:
- Visual Studio Marketplace
- Open VSX Registry
Search for “Colab” and install the official extension.
Open or Create a Notebook
In VS Code:
- Open an existing
.ipynbfile
or - Create a new Jupyter Notebook
Sign In to Google
When prompted:
- Sign in using your Google account
- Authorize Colab access
Select the Colab Kernel
In the notebook interface:
- Click Select Kernel
- Choose Colab
- Select New Colab Server
Your notebook is now running on Google Colab
Dataset Overview
- All data for this lab is located in: SEC-10K-2024/
- You will need to “copy” the folder to your own Google Drive
- Right click on the folder, and then click “Add shortcut to Drive”. This will allow you to access the folder from your drive!
- This folder contains plain-text 10-K filings for multiple publicly traded firms.
- Each file represents one company’s annual report.
When running notebooks on Google Colab, file paths such as ../data/... will not work. Colab runs on a remote virtual machine, so all data must be accessed via mounted storage or downloads.
Research Framing (Important)
You are not training a model yet. Instead, think of this lab as asking structured questions of text, such as:
- What terms dominate risk disclosures?
- How consistent is language across companies?
- Which words survive aggressive cleaning?
- How does preprocessing change the text representation?
Your answers will be supported by intermediate outputs, not final predictions.
NLP Processing Pipeline
You will implement the following pipeline step by step:
- Raw text
- Sentence segmentation
- Tokenization
- Part-of-Speech (POS) tagging
- Stop-word removal
- Stemming / Lemmatization
- Dependency parsing
- String metrics & matching
Each stage produces artifacts that help you answer analytical questions.
Load and Inspect the Data
Mount Drive in Colab
from google.colab import drive
drive.mount('/content/drive')Verify Files
import os
os.listdir("/content/drive/MyDrive")Adjust paths as needed.
For Local Drive (Optional)
from pathlib import Path
DATA_DIR = Path("../data/SEC-10K-2024")
files = list(DATA_DIR.glob("*.txt"))
print(f"Number of 10-K documents: {len(files)}")
# Read a sample document
sample_text = files[0].read_text(encoding="utf-8")
print(sample_text[:1500])1 What sections of the 10-K appear most frequently in the opening text?
This will help you understand the structure of the document and identify key areas for analysis (e.g., risk factors, management discussion). We first start with Sentence Segmentation
import nltk
nltk.download("punkt")
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(sample_text)
print(f"Number of sentences: {len(sentences)}")
sentences[:5]3 What kinds of tokens appear that are not “words” (e.g., symbols, numbers, legal references)?
nltk.download("averaged_perceptron_tagger")
from nltk import pos_tag
pos_tags = pos_tag(tokens[:50])
pos_tags5 Which important business terms survive stop-word removal?
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download("wordnet")
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
stems = [stemmer.stem(t) for t in filtered_tokens[:20]]
lemmas = [lemmatizer.lemmatize(t) for t in filtered_tokens[:20]]
list(zip(filtered_tokens[:20], stems, lemmas))6 Which transformation preserves interpretability better for financial text?
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(sentences[0])
[(token.text, token.dep_, token.head.text) for token in doc]7 How might dependency relationships help identify risk statements or obligations?
from difflib import SequenceMatcher
def similarity(a, b):
return SequenceMatcher(None, a, b).ratio()
similarity(
"risk management strategy",
"enterprise risk management"
)8 Why might approximate string matching be useful for cross-company comparison?
9 Deliverables
Submit word document with answering the questions in addition to the Jupyter notebook with the code and outputs (either .ipynb or .pdf):
10 Key Takeaway
Before we can generate language, we must first discipline text into structure.
This pipeline is the foundation upon which Bag of Words, TF-IDF, embeddings, and generative models are built.